Goto

Collaborating Authors

 separate network


Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters

Neural Information Processing Systems

Motivated by the success of ensembles for uncertainty estimation in supervised learning, we take a renewed look at how ensembles of $Q$-functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL). We begin by identifying a critical flaw in a popular algorithmic choice used by many ensemble-based RL algorithms, namely the use of shared pessimistic target values when computing each ensemble member's Bellman error. Through theoretical analyses and construction of examples in toy MDPs, we demonstrate that shared pessimistic targets can paradoxically lead to value estimates that are effectively optimistic. Given this result, we propose MSG, a practical offline RL algorithm that trains an ensemble of $Q$-functions with independently computed targets based on completely separate networks, and optimizes a policy with respect to the lower confidence bound of predicted action values. Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that on challenging domains such as antmazes, MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin. Additionally, through ablations on benchmarks domains, we verify the critical significance of using independently trained $Q$-functions, and study the role of ensemble size. Finally, as using separate networks per ensemble member can become computationally costly with larger neural network architectures, we investigate whether efficient ensemble approximations developed for supervised learning can be similarly effective, and demonstrate that they do not match the performance and robustness of MSG with separate networks, highlighting the need for new efforts into efficient uncertainty estimation directed at RL.


Review for NeurIPS paper: Deep Variational Instance Segmentation

Neural Information Processing Systems

Weaknesses: The authors often call their method a "one step approach" and criticism other methods for using "heuristic postprocessing". I don't think the authors should be making these comments as there is a separate network to classify the masks. And I don't really buy the justification at the end of Page 6 that the proposed method needs to verify less masks than a typical two-stage method (ie Mask-RCNN) and is thus a "one-stage" method. Rather, I think the authors could be highlighting more the fact that their method does not require any anchors or region proposals, as I think this is a strong argument to be making. I also think the paper could benefit from an analysis of the number of instances that the network can predict. Ie, if the network was trained with a maximum of K instances in the training set, can it correctly predict more than K instances at test time?


Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters

Neural Information Processing Systems

Motivated by the success of ensembles for uncertainty estimation in supervised learning, we take a renewed look at how ensembles of Q -functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL). We begin by identifying a critical flaw in a popular algorithmic choice used by many ensemble-based RL algorithms, namely the use of shared pessimistic target values when computing each ensemble member's Bellman error. Through theoretical analyses and construction of examples in toy MDPs, we demonstrate that shared pessimistic targets can paradoxically lead to value estimates that are effectively optimistic. Given this result, we propose MSG, a practical offline RL algorithm that trains an ensemble of Q -functions with independently computed targets based on completely separate networks, and optimizes a policy with respect to the lower confidence bound of predicted action values. Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that on challenging domains such as antmazes, MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin.


Decoupling Value and Policy for Generalization in Reinforcement Learning

Raileanu, Roberta, Fergus, Rob

arXiv.org Artificial Intelligence

Standard deep reinforcement learning algorithms use a shared representation for the policy and value function. However, we argue that more information is needed to accurately estimate the value function than to learn the optimal policy. Consequently, the use of a shared representation for the policy and value function can lead to overfitting. To alleviate this problem, we propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic. First, IDAAC decouples the optimization of the policy and value function, using separate networks to model them. Second, it introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment. IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors. Moreover, IDAAC learns representations, value predictions, and policies that are more robust to aesthetic changes in the observations that do not change the underlying state of the environment.


Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?

Nachum, Ofir, Tang, Haoran, Lu, Xingyu, Gu, Shixiang, Lee, Honglak, Levine, Sergey

arXiv.org Artificial Intelligence

Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard "shallow" RL architectures. In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.


Learning Objectives for Treatment Effect Estimation

Nie, Xinkun, Wager, Stefan

arXiv.org Machine Learning

We develop a general class of two-step algorithms for heterogeneous treatment effect estimation in observational studies. We first estimate marginal effects and treatment propensities to form an objective function that isolates the heterogeneous treatment effects, and then optimize the learned objective. This approach has several advantages over existing methods. From a practical perspective, our method is very flexible and easy to use: In both steps, we can use any method of our choice, e.g., penalized regression, a deep net, or boosting; moreover, these methods can be fine-tuned by cross-validating on the learned objective. Meanwhile, in the case of penalized kernel regression, we show that our method has a quasi-oracle property, whereby even if our pilot estimates for marginal effects and treatment propensities are not particularly accurate, we achieve the same regret bounds as an oracle who has a-priori knowledge of these nuisance components. We implement variants of our method based on both penalized regression and convolutional neural networks, and find promising performance relative to existing baselines.


Robots Podcast #233: Geometric Methods in Computer Vision, with Kostas Daniilidis

Robohub

In this episode, Jack Rasiel speaks with Kostas Daniilidis, Professor of Computer and Information at the University of Pennsylvania, about new developments in computer vision and robotics. Daniilidis' research team is pioneering new approaches to understanding the 3D structure of the world from simple and ubiquitous 2D images. They are also investigating how these techniques can be used to improve robots' ability to understand and manipulate objects in their environment. Daniilidis puts this in the context of current trends in robot learning and perception, and speculates how it will help bring more robots from the lab to the "real world". How does bleeding edge research become a viable product? Daniilidis speaks to this from personal experience, as an advisor to startups spun out from the GRASP Lab and Penn's Pennovation incubator. Kostas Daniilidis is the Ruth Yalom Stone Professor of Computer and Information Science at the University of Pennsylvania where he has been faculty since 1998.